好,今天拿Y Combinator Blog的文章來小試爬蟲,我保證你會很有成就感 XD
user@ubuntu:/NodeJS/tutorial$ source tutorial/bin/activate
(tutorial) user@ubuntu:/NodeJS/tutorial$
第一行是用來啟動虛擬環境,tutorial是你的虛擬環境的命名,可以自訂。
還沒安裝virtualenv?看安裝教學
(tutorial) user@ubuntu:/NodeJS/tutorial$ python
Python 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>>
沒出現錯誤訊息就表示安裝OK
如何在虛擬環境中安裝scrapy? →運行pip install scrapy
articles_scrapy.py
articles_scrapy.py
,在Node.js專案中我們習慣把未來需要引用的function放在/views
資料夾中。所以現在你的檔案結構會長這樣:
- bin/
----- www
- node_modules/
- public/
- routes/
-------- index.js
- views/
------- error.ejs
------- index.ejs
+ ----- articles_scrapy.py
- app.js
- package.json
- package-lock.json
- junkfood.json
import scrapy
class ArticlesSpider(scrapy.Spider):
name = "articles"
start_urls = [
'https://blog.ycombinator.com/',
]
...
def parse(self, response):
.css()
或.xpath()
,這邊我先用.css()
示範;.loop-section
這個class裡出現,且分別可以在以下定位找到:<a class="article-title">Title</a>
<a class="article-title" href="link_here">Title</a>
<a class="author url fn">Author Name</a>
<ul class="post-categories"><li><a>Tags</a></li></ul>
我們用selector把它寫在剛剛的parse裡:
...
'title': response.css('a.article-title::text').extract_first(),
'link': response.css('a.article-title::attr("href")').extract_first(),
'author': response.css('a.author::text').extract_first(),
'tags': response.css('ul.post-categories > li a::text').extract()
註1:response.css(ul > li)意思是所有ul的子項目li
註2:::text意思是指獲取純文字部分、::attr("href")則是獲取href內的url
註3:extract_first()指獲取陣列內的第一個物件,因為selector選到的東西若沒有指定第幾個,會回傳整串html object
註4:response是我們原本整個頁面內容callback,所以拿它來做select
...
for article in response.css('div.loop-section'):
yield {
'title': article.css('a.article-title::text').extract_first(),
'link': article.css('a.article-title::attr("href")').extract_first(),
'author': article.css('a.author::text').extract_first(),
'tags': article.css('ul.post-categories > li a::text').extract()
}
...
next_page = response.css('div.nav-previous a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
import scrapy
class ArticlesSpider(scrapy.Spider):
name = "articles"
start_urls = [
'https://blog.ycombinator.com/',
]
def parse(self, response):
for article in response.css('div.loop-section'):
yield {
'title': article.css('a.article-title::text').extract_first(),
'link': article.css('a.article-title::attr("href")').extract_first(),
'author': article.css('a.author::text').extract_first(),
'tags': article.css('ul.post-categories > li a::text').extract()
}
next_page = response.css('div.nav-previous a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
articles.json
。它會跑個10來分鐘,因為Y combinator Blog總共有294頁的blog list。(tutorial) user@ubuntu:/NodeJS/tutorial/views$ scrapy runspider articles_spider.py -o articles.json
打開articles.json
,你會得到滿滿的大索引:
[
{"title": "New Year\u2019s Buying Guide", "link": "https://blog.ycombinator.com/b2b-buying-guide/", "author": "Sharon Pope", "tags": ["Lists", "YC News"]},
{"title": "Y Combinator Female Founders Conference 2018", "link": "https://blog.ycombinator.com/y-combinator-female-founders-conference-2018/", "author": "Kat Ma\u00f1alac", "tags": ["Female Founders", "YC News"]},
{"title": "YC Alumni Who Paid It Forward", "link": "https://blog.ycombinator.com/yc-alumni-who-paid-it-forward/", "author": "Michael Seibel", "tags": ["Founder Stories"]},
...
]
小叮嚀:千萬不要覺得會了這個就等於會了爬蟲,是因為Y Combinator Blog的結構很整齊、也沒有登入問題或ajax載入,所以我們這次才能無痛爬蟲初體驗(Y)
下一篇我們會在jupyter notebook上示範如果遇到需要登入才能看到完整資料的網站,該怎麼用python模擬登入?
未來:如果遇到ajax動態載入資料的網站要怎麼破解? (這個剛好也是筆者這個小專案的Keyword詞組推薦來源)
最後我們就會碰到selenium及xvfb —— 當今天連json檔案都沒法在network panel看到時該怎麼hack?
happy scrapy!